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Abstract 
Introduction. In recent decades, knowledge about DNA has been increasingly used to solve biological problems 


(calculations using DNA, long-term storage of information). Principally, we are talking about cases when it is 
required to select artificial nucleotide sequences. Special programs are used to create them. However, existing 
generators do not take into account the physicochemical properties of DNA and do not allow obtaining sequences 
with a pronounced “non-biological” structure. In fact, they generate sequences by distributing nucleotides 
randomly. The objective of this work is to create a generator of quasirandom sequences with a special nucleotide 
structure. It should take into account some physicochemical features of nucleotide structures, and it will be 
involved in storing non-biological information in DNA. 

Materials and Methods. A new GATCGGenerator software for generating quasirandom sequences of nucleotides was 
described. It was presented as SaaS (from “software as a service”), which provided its availability from various devices 
and platforms. The program generated sequences of a certain structure taking into account the guanine-cytosine (GC) 
composition and the content of dinucleotides. The performance of the new program algorithm was presented. The 
requirements for the generated nucleotide sequences were set using a chat in Telegram, the interaction with the user was 
clearly shown. The differences between the input parameters and the specific nucleotide structures obtained as a result 
of the program were determined and generalized. Also, the time costs of generating sequences for different input data 
were given in comparison. Short sequences differing in type, length, GC composition and dinucleotide content were 
studied. The tabular form shows how the input and output parameters are correlated in this case. 

Results. The developed software was compared to existing nucleotide sequence generators. It has been established that 
the generated sequences differ in structure from the known DNA sequences of living organisms, which means that they 
can be used as auxiliary or masking oligonucleotides suitable for molecular biological manipulations (e.g., amplification 
reactions), as well as for storing non-biological information (images, texts, etc.) in DNA molecules. The proposed 
solution makes it possible to form specific sequences from 20 to 5,000 nucleotides long with a given number of 
dinucleotides and without homopolymer fragments. More stringent generation conditions remove known limitations and 
provide the creation of quasirandom sequences of nucleotides according to specified input parameters. In addition to the 
number and length of sequences, it is possible to determine the GC composition, the content of dinucleotides, and the 
nature of the nucleic acid (DNA or RNA) in advance. Examples of short sequences differing in length, GC composition 
and dinucleotide content are given. The obtained 30-nucleotide sequences were tested. The absence of 100 % homology 
with known DNA sequences of living organisms was established. The maximum coincidence was observed for the 
generated sequences with a length of 25 nucleotides (similarity of about 80%). Thus, it has been proved that 
GATCGGenerator can generate non-biological nucleotide sequences with high efficiency. 
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Discussion and Conclusion. The new generator provides the creation of nucleotide sequences in silico with a given GC 
composition. The solution makes it possible to exclude homopolymer fragments, which improves qualitatively the 
physicochemical stability of sequences. 


Keywords: GATCGGenerator, nucleotide sequences generator, synthetic nucleic acids, random sequences, data storage 
in DNA, steganography, NYRN-oligonucleotides, calculations with DNA, cryptography, DNA-tagging in hydrology 
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AHHOTauHA 

Beedenue. B nocneyuue AecaTuneTua 3HaHHa Oo JIHK sce wipe MpHMeHsAIOTCA JIA pelleHHA HeOMOMOTM4eCKHX 3aa4 
(BEIuuceHHA c MoMoubIO JIHK, JonroppemeHHoe xpaHeHHe uH@opMayHn). B nepBytlo O4epeyb perb UeT O CuyyasAx, 
Kora HeEOOXOAMMO TOAOOpaTb HCKyCCTBeHHbIe HYKICOTHAHbIe MOCIeAOBaTeIbHOCTH. JIA Ux CO3TaHUA MUCHOb3YIOTCA 
creljMasIbHble MporpaMMbl. OjHako cyljecTByrollMe reHepaTopbl He YAHTHIBaIOT (PU3HKO-xXuMHYecKHe cBolicTBa JIHK 
WM He MO3BONAIOT MOMWYUATb MOCIeAOBaTeIbHOCTH C ABHO BbIPAKEHHOM «HEOMOMOrHMYeCKOM» CTpyKTypoi. DakTH4ecKH 
OHH TeHEPpHpyIOT MOCIeAOBaTeNbHOCTH, pacnipeyeIAA HYKICOTHAbI CydaHbIM oOOpa30M. Llenb1o waHHoM padoTsl 
ABJIACTCA CO3aHHe TeHepaTopa KBa3HCIIyYaHHbIxX MOCIeAOBAaTeIbHOCTeli C OCOOOM HyKeOTHAHOM cTpyKTypou. Ou 
JOJDKCH YYHTLIBATh HEKOTOpbIe (PH3MKO-XUMHYeCKHe OCOOeCHHOCTH HYKJICOTHAHBIX CTpyKTyp MH OyseT 3alelicTBOBaH 
lip xpaHeHun HeOvoorMyeckol HHdopmauH B THK. 

Mamepuanot uu memodei. OrucaHo HoBoe porpaMMHoe oOecneyeHue GATCGGenerator ana reHepaluu 
KBa3HCIydalHbIx MOCIeOBaTeIbHOCTelHi HyKeCOTHAOB. OHO mpegoctaBiaeTca Kak SaaS (oT auru. software as a 
service — lIporpaMMHoe oGecrieyeHHe Kak ycylyra), 4TO OOecHe4MBaeT erO MOCTYMHOCTb C pa3HbIX yCTpOvcTB U 
tu1aTdpopM. IIporpamMa reHepupyeT NOcieqOBaTeIbHOCTH OMpeseIeHHOM CIpyKTypbl C yYeTOM ryaHHH-WMTO3HHOBOTO 
(GC) coctaBa u cogxepxKaHuaA WuHyKIeoTuAOB. IpeyzctaBeHa padota alropHTMa HOBO UporpamMpl. TpeOoBaHua K 
CreHepHpOBaHHbIM HYKJICOTHAHBIM TOCIICQOBaTeIbHOCTAM 3a/[aHbI C MOMOIIbIO YaTa B «Temerpam» (Telegram), 
HaraqHO 1OKa3aHO B3AaMMOJeiCTBHe C MOIb30BaTeIeM. OnpeyeseHbI H OOOOMeHbI pa3ssIM4HA BXOJHBIX MapaMeTpoB U 
TIOJY4aeMBIX B Pe3yIbTaTe paOoTh! NPOrpaMMBbI KOHKPeTHBIX HYKJICOTHTHBIX CTpyKTyp. TaloKe B COMOCTaBJICHHM JjaHbl 
BPeMeHHbIe 3aTpaTbl TeHepallMuH MOcIeAOBaTebHOCTeH Tp pa3IM4HbIX BXOJHBIX aHHbIX. V3yyenbr KOpoTKue 
TOCHeOBATEIbHOCTH, pa3ziMuarouMeca 0 Tuny, WuHe, GC-coctasy u coxepxKaHHIo AHHyKeoTHA0B. B Tabm“4HOM 
BH e TOKa3aHO, Kak B 9TOM Cilyuae COOTHOCATCA BXOJHBIC HM BbIXOJHbIe WapaMerTpsl. 

Pe3yivmamoi uccaedosanua. Co3yanHoe UporpaMMHoe obecneyeHHe cpaBHWJIM C CYLIeCTBYIOIMMU reHepaTopaMu 
HYKJICOTHIHBIX TMOCIe€LOBaTeIbHOCTeH. YcTaHoBsIeHO, 4YTO TeHepHpyeMble NMOCMeqOBaTebHOCTH OTIIMYAIOTCA 10 
CTpykType oT u3BecTHBIx JJ[HK-nocneqoBaTesbHOCcTeH %KMBbIX OPraHH3MOB, a 3HAYHT, MOryT ObITb HCHOJIb30BaHbI B 


KauyecTBe BCHOMOraTeCJIBHbIX HWJIM MaCKHpyrOUHxX OJIMTOHYKIJICOTHAOB, MPHrOWHbIx 1A MOJI€KyJIAPHO-OMOJIOTH4eCKHX 
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MaHHIyAui (HalpHMep — peakUHH aMiiMpuKkallun), a Take WJId XpaHeHua B MoeKysax JIHK HeOuonormueckon 
nudopMannu (H300paxKeHHH, TekcToBUT.y.). IpeqmoxeHHoe pelieHHe aeT BO3MO%KHOCTb (pOpMHpoBaTb 
cHelHpuyeckne MocteqOBaTebHOCTH AIHHON oT 20 yo 5 000 HykNeoTHAOB C 3aaHHbIM YHCIIOM JHHYKICOTHOB H 
6e3 rOMONOJIMMepHBIX y4acTKoB. bosiee *KeCTKHe YCIOBHA TeHepalluH CHUMAIOT H3BECTHbIe OrpaHHYeHHA UH MO3BONAIOT 
CO31aBaTb KBa3HCIydaliHble MOCICOBATeIbHOCTH HYKJICOTHOB TO 3aaHHbIM BXOJHBIM TapametpaM. Kpome 
KOJIMYECTBA H JVIMHbI NOCIeAOBATeIbHOCTeH MO2KHO 3apaHee OlpexzemuTb GC-coctas, coqepxKaHve JMHYKIICOTHOB U 
pHpoxy HykKieuHopon KucnoTer (HK usm PHK). I[pusogatca mpumepbt KOpoTKHX mMocseqoBaTesbHOCTeH, 
pa3zmuaromjuxca 0 WIMHe, GC-cocTaBy Mu coyepxKaHuIo AuHyKIeoTHA0B. Tlomyyenupie 30-HyKiIeoTHAHEIC 
MOCEAOBATeIbHOCTH NPOWIM MpoBepky. YcTaHoBeHO oTcyTcTBHe 100-npoyeHTHO romomoruu c w3BecTHEIMU JJHK- 
MOCJICHOBaTeIbBHOCTAMH %KMBbIX OpraHH3MoB. MakcuMasibHoe coBlayleHve HaOmroyaoch [Id creHepHpoBaHHBbIx 
TOceqoOBaTesbHOCTeH IMHO 25 HyKNeoTHyOB (cxogcTBo oKoO0 80%). Takum o0pa30M joKa3aHO, 4TO 
GATCGGenerator MoxeT Cc  BBICOKOM 9eKTHBHOCTbIO YTeHepHpoBaTb HeOMONOrM4YeCKHe HYKJICOTHAHBIC 
MIOCJI€OBATeIbHOCTH. 

O6cystcoenue u 3akiro4“enue. Hopbiit reHepaTop NO3BOAeT CO3TaBaTb HYKIICOTHAHbIe MOCIeOBaTeIbHOCTH in silico c 
3ayjaHHbim GC-coctaBom. PelmieHHe aeT BO3MO2%KHOCTb MCKMIIOUMTb TOMOMOJMMepHble (parMeHTbI, YTO KaYeCTBEHHO 


ystyaiaetT (pH3UKO-xXHMUyeCK yO cCTaOHJIBHOCTb MOCIIENOBaTeIbHOCTeH. 


Kiroueesie —cnoea: + GATCGGenerator, reHepaTop HYyKIIeCOTHZHbIX MOCIeAOBaTebHOCTeH, CHHTeTHUeCKHE 
HYKJICHHOBbIC KHCJIOTHI, CilyualiHble MOCIeTOBaTebHOCTH, XpaHeHHe JaHHEIxX B JIHK, creraHorpadua, NYRN- 


ONMPOHYKICOTHABI, BBIYHCIeHHA Cc MOMOobIO JIHK, kpuntorpadua, JIHK-merunkn B ruyqpomornu 
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Introduction. DNA is a unique biopolymer that provides storage, transmission and reproduction of genetic 
information in living organisms. DNA molecules consist of four types of nucleotides containing nitrogenous 
bases: adenine (A), guanine (G), cytosine (C), thymine (T). Their possible combinations provide nucleotide 
sequences forming functional genetic elements. In molecular biology and genetics, the basic investigations are 
carried out on nucleotide sequences of living organisms, but there is an increasing need to create artificial 
sequences, especially, when solving non-biological tasks (e.g., DNA calculations [1, 2], storage in DNA [3], 
cryptography [4], DNA tags in hydrology [5], etc.). 

It is expected that by the end of 2040, the volume of information will reach several yottabytes (1074), which requires 
its structuring and storage. Both of these processes affect significantly the consumption of energy resources, as well as 
the production of storage devices and peripheral devices (hard drives, solid-state drives). To store such an amount of 
information, more than 10° kg of extra pure silicon is required [6], which may not be enough. The solution is seen in 
using the principles of DNA to work with large-scale amounts of data. 

Nucleotide sequences are easily digitized by assigning the corresponding binary codes to individual nucleotides 
[7-11] or blocks of nucleotides [12—14]; therefore, text, graphic or multimedia files can be converted into nucleotide 
sequences [15-18]. Artificial nucleotide sequences can be made manually or generated using special software (DNA 


generators), depending on the tasks being solved. Some DNA generators were developed as independent applications, 
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others — as part of software packages designed to solve general [19] '* 7+ 5 or specific tasks [20]. As a rule, DNA 
generators are developed on the basis of combinatorial approaches and produce random sequences of a given length of 
guanine-cytosine (GC) composition. However, such software solutions do not take into account the chemical properties 
of nucleotides and do not provide obtaining sequences with a certain structure (e.g., without homopolymer sites or long 
repeating motifs). Therefore, the sequences created by such generators cannot always be reproduced in the laboratory. 
Moreover, such sequences may be identical to DNA fragments existing in nature, which introduces ambiguity when 
trying to encode information of a non-biological nature. 

The presented work is aimed at creating a generator of nucleotide sequences of a special structure that can be used 
when encoding text, graphic and other information in DNA molecules. 

Materials and Methods. The criteria that should be kept in mind when creating sequences were defined. The need 
to vary the GC composition, set a certain number of dinucleotides, and exclude homopolymer sites in sequences was 
taken into account. 


A team of authors has developed the GATCGGenerator program in Python 3.6 (Anaconda distribution)°. To create a 
bot’ in Telegram, Numpy 1.19 [21] and the Python GATCGGenerator library were used. The solution was provided as 
SaaS (from “software as a service”), which opened up the possibility of access from different devices and platforms. 

Input parameters included the number of sequences, their length, GC composition, and dinucleotide content. The 
generator excluded repeats with a length of two nucleotides more than four times. The result was presented as a CSV 
file, which contained the following information: sequence, GC composition, and the number of all nucleotides. 

Repeats and homopolymer fragments were stored as a separate list. First, a sequence of four elements was randomly 
generated (random.choice(nuk), where nuc = 'ACGT'). Then the search for repetitions was performed. If there was at 
least one item from the list, a new random generation was performed. Next, the GC and NN composition was 
calculated. If the NN composition did not match the user-defined range, the paired nucleotide was replaced randomly 
and the GC composition was recalculated. If the sequence matched the input parameters, it was written to a set of 
sequences. 

Below is the operation of the program algorithm. 

Type, GCmin, GCmax — range of possible GC content, NNmin, NNmax — range of possible dinucleotide content 
NN%, N — quantity, S — sequence, 1 — sequence length, count — total number of sequences 
Pseudocode 
Start 
Input (Type, GC, NN, N) 

Comprehension of a list of repeating motifs, homopolymer sites rep.list 
Count = 0 
sequences = set() 
IFi<N? 
IF (rep.list(k) C S?) 
Return to step 1. 
ELSE 
NN = len(DL REGEX. findall("join(S))) 
NN_perc = (NN x 2/1) x 100 
IF NNmin <NN_perc < NNmax 
GC = S.count('G') + S.count('C') / 1 x100 
IF GCmin < GC < GCmax 
IF type == DNA 
Step 2. 
A_perc = S.count('A') /1 x 100 


' Nucleotide Sequence Generator. nucleotide-generator.herokuapp.com. URL: https://nucleotide-generator.herokuapp.com/ (accessed: 01.12.2022). 
2 DNA Sequence Tools: Random Sequence Generator. molbiotools.com. URL: http://www.molbiotools.com/randomsequencegenerator.html 
(accessed: 01.12.2022). 

> Random DNA Sequence Generator. faculty.ucr.edu. URL: http://www. faculty.ucr.edu/~mmaduro/random.htm (accessed: 02.12.2022). 

+ Random DNA Sequence GenScript. genscript.com. URL: https://www.genscript.com/sms2/random_dna.html (accessed: 04.12.2022). 

5 Random DNA Generator. Computer software. URL: http://54.235.254.95/cgi-bin/gd/gdRandDNA.cgi (accessed: 04.12.2022). 

® Anaconda. Anaconda Inc. anaconda.com. URL: https://www.anaconda.com/ (accessed: 20.01.2023). 


7 Python telegram bot. github.com. URL: https://github.com/python-telegram-bot/python-telegram-bot (accessed: 01.12.2022). 
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G_perc = S.count('G’) /1 x 100 
C_perc = S.count('C’) /1 x 100 
T_perc = S.count('T') /1 x 100 
U_perc = S.count('U') /1 x 100 
Count = count +1 
sequences.add(S) 
ELSE S = S.replace('T’, 'U') 
Step 2.) 
ELSE 
Return to step 1. 
ELSE 
Random replacement of the second repeated character, 
GC = S.count('G') + S.count('C') / 1 x100 
Output Sequences: (S, GC%, NN%, A%, G%, C%, T/U%) 
End 


The requirements for the generated nucleotide sequences were set using Telegram chat. An example of user 
interaction is shown in Figure 1. 


GATCGGenerator 
bot 


What can this bot do? 


Online generator of DNA/RNA sequences with specified GC and 
dinucleotides content 


August 26 


start 16:06 W 


Send a generator type j¢.9- 


909 


DNA 16:06 4 


DNA generator was chosen. Starting the process 4 ¢.p¢ 


( Enter the number of sequences to generate: 4 ¢.9¢ 
om 50 16:06 

") Enter the sequence length: 4 ¢.9¢ 

@ ‘50 16:06 w 


Specify the GC content in %. Use the following format: <min>- 
<max> 16:06 


50-60 16:06 W 


Specify the dinucleotides content in %. Use the following format: 
<min>-<max>. If dinucleotides are not mandatory send 0-0. 4 ¢.9¢ 


19-20 16:06 W 


Your job is queued. Please, wait for the result. 4 ¢.9¢ 


© 898 98 


Fig. 1. Example ofa user chat in Telegram 
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In the framework of the presented work, the functionality of random sequence generators and GATCGGenerator 
were compared. The differences between the input parameters and the specific nucleotide structures obtained as a result 
of the program were determined (Table 1). 


Table | 
Comparison of GATCGGenerator functionality to other nucleotide sequence generators 
DNA 
. Ri 
Nucleotide eeuuenee endon Random Random 
GATCGGenerator Tools: DNA 
Sequence DNA DNA 
[20] : Random Sequence i io 
Generator 19 | Sequence Generator 
Sequence Generator 
Generator? 
Maxi length 
ee 5,000 1,000,000 10,000 1,000 
(nucleotides) 
1; 10; 50; 
Number of sequences 100 1 100 100 
Input GC composition " é — + (*) 
(%) 
GC composition (%) number — number 
Input NN interval 
composition (%) — 
No homopolymer 2 
sites 
DNA/RNA / 
Sequence type DNA/RNA DNA ie : DNA 
Protein 
Output of results CSV file Text on the screen 
(*) User enters AT composition 


GATCGGenerator has a broader functionality, it allows the user to specify the number of dinucleotides, create 
sequences without extended homopolymer sites and repeats that affect the success of the experiment. In existing 
generators, it is only possible to vary the GC composition. 

The program created by the authors of this research generates a given number of quasi-random sequences of 
nucleotides that do not have homology with natural DNA, but are suitable for molecular biological manipulations. 

Research Results. GATCGGenerator allows you to generate specific DNA or RNA sequences from 20 to 5,000 
nucleotides long, containing a given number of dinucleotides and not containing homopolymer sites (no more than two 
identical nucleotides located side by side). More stringent generation conditions can cause a long selection of 
sequences. As an example, we give a small range of possible content of guanine and cytosine and dinucleotides (e.g., 
GC composition 45-50 % and NN composition 10—20 %). The operating period of the program for various input data is 
shown in Table 2. 


Table 2 
Sequence generation time for different inputs 
Data inputs . 
Time (s) 
Length Number GC, % NN, % 
20 10 50-60 20-50 3.45 
30 10 50-60 20-50 3.91 
20 10 50-60 40-50 9.74 
30 10 50-60 40-50 9.53 
30 10 40-50 20-20 8.80 
1,000 100 45—50 40-50 11.49 
2,000 100 45—50 10-20 240.25 
5,000 100 50-60 20-50 11.57 
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GATCGGenerator, through more stringent sequence generation conditions, removes the limitations of known DNA 
generators and creates quasi-random sequences of nucleotides depending on the indicated input parameters. You can 
specify the required number of sequences, their length, GC composition and dinucleotide content, as well as the nature 
of the nucleic acid (DNA or RNA). Specifically, sequences created by GATCGGenerator can be used in DNA 
steganography, applied to protect and transmit information through hiding the message content in the nucleotide 
sequence [3]. 

The proposed software solution (GATCGGenerator) provides obtaining a set of quasi-random sequences of nucleotides 
depending on user-defined input parameters (type of nucleic acid, sequence length, GC and dinucleotide composition). 
GATCGGenerator excludes the presence of any nucleotide repeats and homopolymer sites longer than three elements. The 
generated sequences can be used as service or masking sequences (e.g., in DNA steganography) and are suitable for any non- 
biological enzymatic manipulations. It is possible to generate numerous artificial nucleotide sequences and use them to create a 
universal oligotheca suitable for multiple encoding of non-biological data and their long-term storage. 

The data presented in Table 3 summarizes the results of the program. For a certain type of nucleic acid (in this case, 
DNA), the following data is shown: the content of dinucleotides (NN %), the number of generated sequences, their 
length (nucleotides — nt), and GC composition. 


Table 3 
Examples of short sequences differing in length, GC composition, and dinucleotide content (%) 
Input parameters Output parameters 
T Numb Length, | GC, | NN, Nucleotide sequence, 5’—3’ |Length,| GC, | NN, 
mber 
el {ont | % | nt | % | %* 


CTGG**TATATCGGAATCATATCGCGCAGTGT | 30 46.7 | 20.0 
AATCAGCTAGTAGGACGCAGTAGTGAATCA — {30 43.3 | 20.0 


a) a GAATGTAGTCCTAGGCACATACTACGTAGC = |30 46.7 | 20.0 
” AGTTGCACTGAAGTCTATGATCTGGCATGC | 30 46.7 | 20.0 

20 GACACACTACTATGGACGTGAGGCACTTAC  |30 50.0 | 20.0 
TCAGCTCAGCGCCAATCGAGCTTATAGTGC | 30 53.3 | 20.0 

51 GAGGCTATCGTCAAGCATAGACCGTGTGCT | |30 53.3 | 20.0 


5 30 60 GACTCAGTAGCTGCTCCGGACATACAGCCT  /|30 56.7 | 20.0 
TCGCGCGTTAGACTTAGGTCTCATCGCAGC | 30 56.7 | 20.0 
ACGCTCACAGGAGTTCGCATCGAACGATGC  |30 56.7 | 20.0 
ACGACAGTGATATAGCACGACGTGCTCATA | 30 46.7 |0.0 
GACTACATCTGATAGTACACGTGCTGCACT |30 46.7 |0.0 


DNA 5 a 0 | TCTATCTCTGCTAGAGCGCTCGTCACTCTA /30 50.0 |0.0 
TCTGATCTACTATAGCGATACGTGAGAGTG |30 43.3 | 0.0 
ACACATATATCGACGCACGCGTCGTAGTAC | 30 50.0 | 0.0 
TGCATGACCATGCTTGCGGTAGACATTCA 50 52.0 120.0 
GACGCGCGAATAGTAGGACGA 
GCATACGAGTGGCATACATATTAGACTAT 50 42.0 120.0 
ACGGTAGTGCATATGGTGCAA 

4l-— CTGAGACTCCTCTCTGTGGAGCTCCTAGTA 
: x 60 my CCGTCACGCGTGCTCTGAAG a cau ip 


CTGTGTGAACATACGATGCATTCTCATCTC 
GGTATGGCTGAAGTGCACAT 
GCGCTGACGTCATGGTTCATACCAATGTA 
GCATGATGTGCGATAGGCACA 

* NN shows the fraction (%) of the dinucleotides contained in the nucleotide sequence. 

** Dinucleotides are highlighted in bold. 


50 46.0 |20.0 


The obtained 30-nucleotide sequences were tested using the Blast tool from NCBI. The absence of 100% homology 
with known DNA sequences of living organisms was identified. The maximum coincidence was observed for the 
generated sequences with a length of 25 nucleotides (similarity is about 80 %). This indicates the ability of the 
GATCGGenerator to generate non-biological nucleotide sequences with high efficiency. It can be assumed that the 
sequences generated in this way do not have an absolute coincidence with the nucleotide fragments of living organisms. 
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In this case, special DNA-oligonucleotides of artificial origin containing informative and service parts can be used 
as a convenient information carrier. Recently, the authors of this work have proposed the use of NYRN- 
oligonucleotides [14] consisting of: 

— internal part (YR)n encoding the encrypted information; 

— service (auxiliary) parts S1 and S2 flanking sequence (YR)n (Fig. 2). 


> (N)x (YR)n (N)m 3? 
service encoding service 


part S1 part part S2 


Fig. 2. Structure of NYRN-oligonucleotides: N — degenerate nucleotides; Y — pyrimidines (C or T); R — purines (A or G); 
k, n, m — indices corresponding to the length of the part 


The length of the sites (n, k and m) may vary, but the structure of the service parts should provide the 
successful course of amplification reactions (length more than 18 nt, 40-60 % GC composition, absence of 
homopolymer sites and repeats). GATCGGenerator allows including NN dinucleotides containing identical paired 
nucleotides (e.g., AA, GG, CC, TT or UU for RNA), which can increase the specificity of molecular hybridization 
of nucleic acids. 

Discussion and Conclusion. Thus, based on the results of the scientific investigation performed, a software 
solution (GATCGGenerator) has been proposed, which, in comparison to traditional approaches, assumes more 
stringent conditions for generating sequences. Due to this feature, the limitations of known DNA generators are 
removed, and quasi-random sequences of nucleotides are formed depending on the specified input parameters. The 
obtained 30-nucleotide sequences were studied. The test allowed us to establish the absence of 100 % homology 
with known DNA sequences of living organisms. The generated sequences with a length of 25 nucleotides 
coincided as much as possible (by about 80 %). 

Note also that in order to hide information in NYRN-oligonucleotides, it is required to mix them with masking 
DNA. Masking sequences should be similar to sequences of NYRN-oligonucleotides, so that when trying to read 
hidden information, it would be impossible to recognize them without key sequences. The addressee should know 
the key sequences — primers to the service sites of NYRN-oligonucleotides. The addressee can decipher the 
transmitted message by isolating informative nucleotide sequences using a polymerase chain reaction followed by 
sequencing. A set of NYRN- and masking oligonucleotides can be easily obtained using GATCGGenerator, 
synthesized, and then stored as an oligotheca. To do this, it is enough to determine the optimal NYRN- 
oligonucleotides with subsequent filling of the oligotheca. In the future, it is planned to conduct laboratory 
experiments in order to test the proposed method of storing non-biological information and checking the viability 


of oligotheca obtained using the generator. 
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XHMHYeCKHX MeTON0B aHalu3a OnonomMMepoB Uxnctutyta OMoxuMHH UM TeHeTHKH — o6o0coOmeHHOTO 
CTpyKTypHoro nogpa3qeneHua WDexepanbHoro YrocyapcTBeHHOTO O610J>KeTHOTO Hay4HOrO yu4pexeHuA 
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Anekceii Buktopopuy Uemepuc, JoKTop OnouormueckHx Hayk, Mpodeccop, riaBHbIM HayyHbI COTpyHHK 
Viuctutyta OMoxuMHH UM YreHeTHKH — oOOocoOmMeHHOrTO cCTpyKTypHoro nog”pa3qeneHua DexepanbHoro 
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3aAeleHuoil 6KIaO Coaemopos: 


O.1O. Kuppanopa — pa3paboTKa mporpaMMHoro oOecneyeHHA, MOATOTOBKa TeKCTAa, pacueTbl, POPMyIMpoBKa 
BbIBOJOB. 


P.P. TapapbytauHos — KOHCybTHpOBaHHe 0 MpeAMeTHOM OONacTu, TecTHpoBaHHe IIO, AopaboTKa TexctTa, 
KOPPeKTHPOBKa BbIBOJOB. 


VM. Tyoaiiazynmun — Hay4Hoe pyKOBOACTBO, KOppeKTHPOBKa BbIBOAOB, JOpadoTKa TeKCTa CTAaTbH. 


A.B. Uemepuc — dopMupoBaHne OCHOBHOH KOHUCHIUMN, Weel WM 3afad HCCICOBAHHA, aHaJIH3 pe3yJIbTaTOB 
NCCE TOBAHHA, nopaobotTKka TCKCTa, KOPpCKTHPOBKa BBIBOJOB. 


Kongdauxm unmepecos: aBtopbl 3aABIAIOT OO OTCYTCTBHM KOH(IMKTa HHTeEpecos. 


Bce aemopbl npoiumalu u odoopusu OKOHYaMebHblU 6apuaHm pyKonucu. 


